How do services and churn correlate?
The goal of this analysis is to find the services that correlate with customer churn. For example, what services do the customers with the highest churn rate use?
- Assumptions of multiple regression:
There must be a linear relationship between the outcome variable and the independent variables.
Multiple regression assumes that the residuals are normally distributed.
Multiple regression assumes that the independent variables are not highly correlated with each other.
The variance of error terms are similar across the values of the independent variables.
Multiple linear regression requires at least two independent variables, which can be nominal, ordinal, or interval/ratio level variables.
- We are using Python because it is very effective to use for data analysis. There are also many different packages that can be added to python to significantly increase the functionality, such as pandas and numpy.
- Multiple regression will allow us to analyze each different type of service variable and find out the services and variables that correlate with tenure. An organization would then be able to use this information to try and decrease overall customer churn.
Data preparation will consist of cleaning up bad data, such as replacing nulls, removing duplicated rows, etc. We will also convert categorical variables to numerical so we can run linear regression.
The target variable is Churn. The predictor variables are all variables besides customer demographics, such as city, state, age, gender, location, and job. These are variables that the company cannot control and can not change. The other variables are possible for the company to try to change to increase tenure. We will be using the services variables such as Contract, Payment Methods, and Internet Service.
To prepare the data for analysis, we will be removing unneeded columns, describing the data, types, finding missing information, deleting duplicates, etc.
# Importing packages to be used
import numpy as np
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Importing the data
df = pd.read_csv('Churn Data\churn_clean.csv')
# Information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CaseOrder 10000 non-null int64 1 Customer_id 10000 non-null object 2 Interaction 10000 non-null object 3 UID 10000 non-null object 4 City 10000 non-null object 5 State 10000 non-null object 6 County 10000 non-null object 7 Zip 10000 non-null int64 8 Lat 10000 non-null float64 9 Lng 10000 non-null float64 10 Population 10000 non-null int64 11 Area 10000 non-null object 12 TimeZone 10000 non-null object 13 Job 10000 non-null object 14 Children 10000 non-null int64 15 Age 10000 non-null int64 16 Income 10000 non-null float64 17 Marital 10000 non-null object 18 Gender 10000 non-null object 19 Churn 10000 non-null object 20 Outage_sec_perweek 10000 non-null float64 21 Email 10000 non-null int64 22 Contacts 10000 non-null int64 23 Yearly_equip_failure 10000 non-null int64 24 Techie 10000 non-null object 25 Contract 10000 non-null object 26 Port_modem 10000 non-null object 27 Tablet 10000 non-null object 28 InternetService 10000 non-null object 29 Phone 10000 non-null object 30 Multiple 10000 non-null object 31 OnlineSecurity 10000 non-null object 32 OnlineBackup 10000 non-null object 33 DeviceProtection 10000 non-null object 34 TechSupport 10000 non-null object 35 StreamingTV 10000 non-null object 36 StreamingMovies 10000 non-null object 37 PaperlessBilling 10000 non-null object 38 PaymentMethod 10000 non-null object 39 Tenure 10000 non-null float64 40 MonthlyCharge 10000 non-null float64 41 Bandwidth_GB_Year 10000 non-null float64 42 Item1 10000 non-null int64 43 Item2 10000 non-null int64 44 Item3 10000 non-null int64 45 Item4 10000 non-null int64 46 Item5 10000 non-null int64 47 Item6 10000 non-null int64 48 Item7 10000 non-null int64 49 Item8 10000 non-null int64 dtypes: float64(7), int64(16), object(27) memory usage: 3.8+ MB
# describing the data
df.describe()
| CaseOrder | Zip | Lat | Lng | Population | Children | Age | Income | Outage_sec_perweek | ... | MonthlyCharge | Bandwidth_GB_Year | Item1 | Item2 | Item3 | Item4 | Item5 | Item6 | Item7 | Item8 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.0000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | ... | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 5000.50000 | 49153.319600 | 38.757567 | -90.782536 | 9756.562400 | 2.0877 | 53.078400 | 39806.926771 | 10.001848 | 12.016000 | ... | 172.624816 | 3392.341550 | 3.490800 | 3.505100 | 3.487000 | 3.497500 | 3.492900 | 3.497300 | 3.509500 | 3.495600 |
| std | 2886.89568 | 27532.196108 | 5.437389 | 15.156142 | 14432.698671 | 2.1472 | 20.698882 | 28199.916702 | 2.976019 | 3.025898 | ... | 42.943094 | 2185.294852 | 1.037797 | 1.034641 | 1.027977 | 1.025816 | 1.024819 | 1.033586 | 1.028502 | 1.028633 |
| min | 1.00000 | 601.000000 | 17.966120 | -171.688150 | 0.000000 | 0.0000 | 18.000000 | 348.670000 | 0.099747 | 1.000000 | ... | 79.978860 | 155.506715 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 2500.75000 | 26292.500000 | 35.341828 | -97.082813 | 738.000000 | 0.0000 | 35.000000 | 19224.717500 | 8.018214 | 10.000000 | ... | 139.979239 | 1236.470827 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 |
| 50% | 5000.50000 | 48869.500000 | 39.395800 | -87.918800 | 2910.500000 | 1.0000 | 53.000000 | 33170.605000 | 10.018560 | 12.000000 | ... | 167.484700 | 3279.536903 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 4.000000 | 3.000000 |
| 75% | 7500.25000 | 71866.500000 | 42.106908 | -80.088745 | 13168.000000 | 3.0000 | 71.000000 | 53246.170000 | 11.969485 | 14.000000 | ... | 200.734725 | 5586.141369 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
| max | 10000.00000 | 99929.000000 | 70.640660 | -65.667850 | 111850.000000 | 10.0000 | 89.000000 | 258900.700000 | 21.207230 | 23.000000 | ... | 290.160419 | 7158.981530 | 7.000000 | 7.000000 | 8.000000 | 7.000000 | 7.000000 | 8.000000 | 7.000000 | 8.000000 |
8 rows × 23 columns
# Checking for null values
df.isna().any()
CaseOrder False Customer_id False Interaction False UID False City False State False County False Zip False Lat False Lng False Population False Area False TimeZone False Job False Children False Age False Income False Marital False Gender False Churn False Outage_sec_perweek False Email False Contacts False Yearly_equip_failure False Techie False Contract False Port_modem False Tablet False InternetService False Phone False Multiple False OnlineSecurity False OnlineBackup False DeviceProtection False TechSupport False StreamingTV False StreamingMovies False PaperlessBilling False PaymentMethod False Tenure False MonthlyCharge False Bandwidth_GB_Year False Item1 False Item2 False Item3 False Item4 False Item5 False Item6 False Item7 False Item8 False dtype: bool
# Checking for duplicates
df.duplicated().any()
False
# Removing the customer demograpic variables that we dont need
df = df.drop(['CaseOrder','Customer_id','Interaction','UID','City','State','County','Zip','Lat','Lng',
"Area",'TimeZone','Job'], axis=1)
df.head()
| Population | Children | Age | Income | Marital | Gender | Churn | Outage_sec_perweek | Contacts | ... | MonthlyCharge | Bandwidth_GB_Year | Item1 | Item2 | Item3 | Item4 | Item5 | Item6 | Item7 | Item8 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38 | 0 | 68 | 28561.99 | Widowed | Male | No | 7.978323 | 10 | 0 | ... | 172.455519 | 904.536110 | 5 | 5 | 5 | 3 | 4 | 4 | 3 | 4 |
| 1 | 10446 | 1 | 27 | 21704.77 | Married | Female | Yes | 11.699080 | 12 | 0 | ... | 242.632554 | 800.982766 | 3 | 4 | 3 | 3 | 4 | 3 | 4 | 4 |
| 2 | 3735 | 4 | 50 | 9609.57 | Widowed | Female | No | 10.752800 | 9 | 0 | ... | 159.947583 | 2054.706961 | 4 | 4 | 2 | 4 | 4 | 3 | 3 | 3 |
| 3 | 13863 | 1 | 48 | 18925.23 | Married | Male | No | 14.913540 | 15 | 2 | ... | 119.956840 | 2164.579412 | 4 | 4 | 4 | 2 | 5 | 4 | 3 | 3 |
| 4 | 11352 | 0 | 83 | 40074.19 | Separated | Male | Yes | 8.147417 | 16 | 2 | ... | 149.948316 | 271.493436 | 4 | 4 | 4 | 3 | 4 | 4 | 4 | 5 |
5 rows × 37 columns
- Univariate and Bivariate visualizations
# Churn Data
churn_data = df['Churn'].value_counts()
churn_data.plot(kind='bar')
plt.title('Churn Rate')
plt.xlabel('Churn')
plt.ylabel('People')
plt.show()
# Payment Data
payment_data = df['PaymentMethod'].value_counts()
payment_data.plot(kind='bar')
plt.title('Payment Method')
plt.xlabel('Method')
plt.ylabel('People')
plt.show()
# Contract Data
payment_data = df['Contract'].value_counts()
payment_data.plot(kind='bar')
plt.title('Contract')
plt.xlabel('Contract Type')
plt.ylabel('People')
plt.show()
# Tenure Data
tenure_data = df['Tenure']
tenure_data.plot(kind='hist')
plt.title('Tenure Data')
plt.xlabel('Tenure Length')
plt.ylabel('Amount of people')
plt.show()
# Tenure Data
tenure_data = df['InternetService'].value_counts()
tenure_data.plot(kind='bar')
plt.title('Internet Service')
plt.xlabel('Internet Type')
plt.ylabel('Amount of people')
plt.show()
# Internet Service to Churn
sns.catplot(x='Churn', hue='InternetService', kind = 'count', data=df)
<seaborn.axisgrid.FacetGrid at 0x1f4d76dd730>
# Contract to Churn
sns.catplot(x='Churn', hue='Contract', kind = 'count', data=df)
<seaborn.axisgrid.FacetGrid at 0x1f4d7738820>
# Payment Method to Churn
sns.catplot(x='Churn', hue='PaymentMethod', kind = 'count', data=df)
<seaborn.axisgrid.FacetGrid at 0x1f4d778ebb0>
- Export Data
# export file to csv
df.to_csv('cleaned_data.csv', index = False)
- Initial regression
# Convert categorical variables to numerical
df = pd.get_dummies(df)
#Prepare variables
observations = len(df)
variables = df.columns[:-1]
X = df.iloc[:]
y = df['Churn_Yes']
import statsmodels.api as sm
Xc = sm.add_constant(X)
linear_regression = sm.OLS(y,Xc)
fitted_model = linear_regression.fit()
fitted_model.summary()
| Dep. Variable: | Churn_Yes | R-squared: | 1.000 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 1.000 |
| Method: | Least Squares | F-statistic: | 4.025e+27 |
| Date: | Mon, 10 May 2021 | Prob (F-statistic): | 0.00 |
| Time: | 16:29:08 | Log-Likelihood: | 2.8481e+05 |
| No. Observations: | 10000 | AIC: | -5.695e+05 |
| Df Residuals: | 9954 | BIC: | -5.692e+05 |
| Df Model: | 45 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 0.0559 | 3.52e-15 | 1.59e+13 | 0.000 | 0.056 | 0.056 |
| Population | -1.992e-20 | 7.2e-20 | -0.277 | 0.782 | -1.61e-19 | 1.21e-19 |
| Children | -1.011e-14 | 3.66e-15 | -2.760 | 0.006 | -1.73e-14 | -2.93e-15 |
| Age | -7.016e-16 | 3.9e-16 | -1.799 | 0.072 | -1.47e-15 | 6.27e-17 |
| Income | -1.778e-18 | 3.69e-20 | -48.236 | 0.000 | -1.85e-18 | -1.71e-18 |
| Outage_sec_perweek | -3.414e-16 | 3.49e-16 | -0.977 | 0.329 | -1.03e-15 | 3.43e-16 |
| 6.164e-16 | 3.43e-16 | 1.794 | 0.073 | -5.7e-17 | 1.29e-15 | |
| Contacts | -5.132e-16 | 1.05e-15 | -0.488 | 0.625 | -2.57e-15 | 1.55e-15 |
| Yearly_equip_failure | -1.762e-15 | 1.63e-15 | -1.078 | 0.281 | -4.97e-15 | 1.44e-15 |
| Tenure | 1.643e-14 | 9.67e-15 | 1.699 | 0.089 | -2.53e-15 | 3.54e-14 |
| MonthlyCharge | 6.154e-16 | 3.61e-16 | 1.706 | 0.088 | -9.18e-17 | 1.32e-15 |
| Bandwidth_GB_Year | -2.033e-16 | 1.18e-16 | -1.723 | 0.085 | -4.35e-16 | 2.8e-17 |
| Item1 | 1.759e-15 | 1.49e-15 | 1.181 | 0.238 | -1.16e-15 | 4.68e-15 |
| Item2 | 3.077e-16 | 1.39e-15 | 0.221 | 0.825 | -2.43e-15 | 3.04e-15 |
| Item3 | -1.193e-15 | 1.28e-15 | -0.933 | 0.351 | -3.7e-15 | 1.31e-15 |
| Item4 | -5.86e-16 | 1.14e-15 | -0.512 | 0.608 | -2.83e-15 | 1.66e-15 |
| Item5 | -1.545e-15 | 1.19e-15 | -1.301 | 0.193 | -3.87e-15 | 7.83e-16 |
| Item6 | 3.274e-16 | 1.22e-15 | 0.268 | 0.789 | -2.07e-15 | 2.72e-15 |
| Item7 | -7.441e-16 | 1.16e-15 | -0.643 | 0.520 | -3.01e-15 | 1.52e-15 |
| Item8 | -3.967e-16 | 1.1e-15 | -0.360 | 0.719 | -2.55e-15 | 1.76e-15 |
| Marital_Divorced | 0.0112 | 2.16e-15 | 5.16e+12 | 0.000 | 0.011 | 0.011 |
| Marital_Married | 0.0112 | 2.24e-15 | 4.99e+12 | 0.000 | 0.011 | 0.011 |
| Marital_Never Married | 0.0112 | 2.21e-15 | 5.05e+12 | 0.000 | 0.011 | 0.011 |
| Marital_Separated | 0.0112 | 2.18e-15 | 5.12e+12 | 0.000 | 0.011 | 0.011 |
| Marital_Widowed | 0.0112 | 2.18e-15 | 5.13e+12 | 0.000 | 0.011 | 0.011 |
| Gender_Female | 0.0186 | 2.79e-15 | 6.68e+12 | 0.000 | 0.019 | 0.019 |
| Gender_Male | 0.0186 | 7.22e-15 | 2.58e+12 | 0.000 | 0.019 | 0.019 |
| Gender_Nonbinary | 0.0186 | 5.89e-15 | 3.16e+12 | 0.000 | 0.019 | 0.019 |
| Churn_No | -0.4721 | 2.32e-15 | -2.04e+14 | 0.000 | -0.472 | -0.472 |
| Churn_Yes | 0.5279 | 2.5e-15 | 2.11e+14 | 0.000 | 0.528 | 0.528 |
| Techie_No | 0.0279 | 2.21e-15 | 1.26e+13 | 0.000 | 0.028 | 0.028 |
| Techie_Yes | 0.0279 | 2.28e-15 | 1.22e+13 | 0.000 | 0.028 | 0.028 |
| Contract_Month-to-month | 0.0186 | 1.85e-15 | 1.01e+13 | 0.000 | 0.019 | 0.019 |
| Contract_One year | 0.0186 | 2.13e-15 | 8.76e+12 | 0.000 | 0.019 | 0.019 |
| Contract_Two Year | 0.0186 | 2.07e-15 | 9e+12 | 0.000 | 0.019 | 0.019 |
| Port_modem_No | 0.0279 | 2.05e-15 | 1.36e+13 | 0.000 | 0.028 | 0.028 |
| Port_modem_Yes | 0.0279 | 2.03e-15 | 1.37e+13 | 0.000 | 0.028 | 0.028 |
| Tablet_No | 0.0279 | 2.09e-15 | 1.34e+13 | 0.000 | 0.028 | 0.028 |
| Tablet_Yes | 0.0279 | 2.1e-15 | 1.33e+13 | 0.000 | 0.028 | 0.028 |
| InternetService_DSL | 0.0186 | 3.42e-14 | 5.45e+11 | 0.000 | 0.019 | 0.019 |
| InternetService_Fiber Optic | 0.0186 | 2.17e-14 | 8.57e+11 | 0.000 | 0.019 | 0.019 |
| InternetService_None | 0.0186 | 1.05e-14 | 1.77e+12 | 0.000 | 0.019 | 0.019 |
| Phone_No | 0.0279 | 2.58e-15 | 1.08e+13 | 0.000 | 0.028 | 0.028 |
| Phone_Yes | 0.0279 | 2.43e-15 | 1.15e+13 | 0.000 | 0.028 | 0.028 |
| Multiple_No | 0.0279 | 2.85e-15 | 9.79e+12 | 0.000 | 0.028 | 0.028 |
| Multiple_Yes | 0.0279 | 3.32e-15 | 8.41e+12 | 0.000 | 0.028 | 0.028 |
| OnlineSecurity_No | 0.0279 | 3.28e-15 | 8.52e+12 | 0.000 | 0.028 | 0.028 |
| OnlineSecurity_Yes | 0.0279 | 5.49e-15 | 5.09e+12 | 0.000 | 0.028 | 0.028 |
| OnlineBackup_No | 0.0279 | 1.36e-15 | 2.05e+13 | 0.000 | 0.028 | 0.028 |
| OnlineBackup_Yes | 0.0279 | 4e-15 | 6.99e+12 | 0.000 | 0.028 | 0.028 |
| DeviceProtection_No | 0.0279 | 2.16e-15 | 1.29e+13 | 0.000 | 0.028 | 0.028 |
| DeviceProtection_Yes | 0.0279 | 4.64e-15 | 6.03e+12 | 0.000 | 0.028 | 0.028 |
| TechSupport_No | 0.0279 | 3.29e-15 | 8.5e+12 | 0.000 | 0.028 | 0.028 |
| TechSupport_Yes | 0.0279 | 2.4e-15 | 1.17e+13 | 0.000 | 0.028 | 0.028 |
| StreamingTV_No | 0.0279 | 5.44e-15 | 5.14e+12 | 0.000 | 0.028 | 0.028 |
| StreamingTV_Yes | 0.0279 | 8.37e-15 | 3.34e+12 | 0.000 | 0.028 | 0.028 |
| StreamingMovies_No | 0.0279 | 3.28e-15 | 8.53e+12 | 0.000 | 0.028 | 0.028 |
| StreamingMovies_Yes | 0.0279 | 6.41e-15 | 4.35e+12 | 0.000 | 0.028 | 0.028 |
| PaperlessBilling_No | 0.0279 | 2.05e-15 | 1.37e+13 | 0.000 | 0.028 | 0.028 |
| PaperlessBilling_Yes | 0.0279 | 2.06e-15 | 1.36e+13 | 0.000 | 0.028 | 0.028 |
| PaymentMethod_Bank Transfer(automatic) | 0.0140 | 2.09e-15 | 6.68e+12 | 0.000 | 0.014 | 0.014 |
| PaymentMethod_Credit Card (automatic) | 0.0140 | 2.12e-15 | 6.58e+12 | 0.000 | 0.014 | 0.014 |
| PaymentMethod_Electronic Check | 0.0140 | 1.86e-15 | 7.51e+12 | 0.000 | 0.014 | 0.014 |
| PaymentMethod_Mailed Check | 0.0140 | 2.05e-15 | 6.8e+12 | 0.000 | 0.014 | 0.014 |
| Omnibus: | 1598.478 | Durbin-Watson: | 0.700 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 2831.137 |
| Skew: | 1.031 | Prob(JB): | 0.00 |
| Kurtosis: | 4.593 | Cond. No. | 4.42e+16 |
- Reducing the dataset
We can use correlation and heatmap to figure out the most important variables for regression
df.corr()
| Population | Children | Age | Income | Outage_sec_perweek | Contacts | Yearly_equip_failure | Tenure | MonthlyCharge | ... | StreamingTV_No | StreamingTV_Yes | StreamingMovies_No | StreamingMovies_Yes | PaperlessBilling_No | PaperlessBilling_Yes | PaymentMethod_Bank Transfer(automatic) | PaymentMethod_Credit Card (automatic) | PaymentMethod_Electronic Check | PaymentMethod_Mailed Check | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Population | 1.000000 | -0.005877 | 0.010538 | -0.008639 | 0.005483 | 0.017962 | 0.004019 | -0.004483 | -0.003559 | -0.004778 | ... | 0.006590 | -0.006590 | 0.005882 | -0.005882 | -0.008656 | 0.008656 | -0.001749 | 0.011399 | -0.013628 | 0.006077 |
| Children | -0.005877 | 1.000000 | -0.029732 | 0.009942 | 0.001889 | 0.004479 | -0.020776 | 0.007321 | -0.005091 | -0.009781 | ... | 0.007106 | -0.007106 | -0.003461 | 0.003461 | -0.006828 | 0.006828 | -0.005761 | 0.005771 | 0.003441 | -0.003750 |
| Age | 0.010538 | -0.029732 | 1.000000 | -0.004091 | -0.008047 | 0.001588 | 0.015068 | 0.008577 | 0.016979 | 0.010729 | ... | 0.000197 | -0.000197 | -0.010125 | 0.010125 | -0.002247 | 0.002247 | 0.020689 | -0.000372 | 0.000588 | -0.020795 |
| Income | -0.008639 | 0.009942 | -0.004091 | 1.000000 | -0.010011 | -0.009267 | 0.001233 | 0.005423 | 0.002114 | -0.003014 | ... | 0.003324 | -0.003324 | 0.002186 | -0.002186 | 0.012049 | -0.012049 | -0.013111 | -0.005798 | 0.003597 | 0.014535 |
| Outage_sec_perweek | 0.005483 | 0.001889 | -0.008047 | -0.010011 | 1.000000 | 0.003994 | 0.015092 | 0.002909 | 0.002932 | 0.020496 | ... | -0.008952 | 0.008952 | -0.018086 | 0.018086 | 0.011208 | -0.011208 | -0.007203 | 0.011445 | 0.004995 | -0.009557 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| PaperlessBilling_Yes | 0.008656 | 0.006828 | 0.002247 | -0.012049 | -0.011208 | -0.012632 | -0.002649 | 0.016603 | -0.000655 | 0.001348 | ... | 0.016353 | -0.016353 | -0.000692 | 0.000692 | -1.000000 | 1.000000 | 0.005811 | 0.001891 | 0.003988 | -0.012078 |
| PaymentMethod_Bank Transfer(automatic) | -0.001749 | -0.005761 | 0.020689 | -0.013111 | -0.007203 | -0.008311 | 0.021861 | -0.009500 | -0.014228 | 0.003266 | ... | -0.004482 | 0.004482 | 0.013449 | -0.013449 | -0.005811 | 0.005811 | 1.000000 | -0.274714 | -0.384229 | -0.291882 |
| PaymentMethod_Credit Card (automatic) | 0.011399 | 0.005771 | -0.000372 | -0.005798 | 0.011445 | 0.017390 | 0.003508 | 0.007344 | 0.011513 | 0.011922 | ... | -0.004575 | 0.004575 | -0.013997 | 0.013997 | -0.001891 | 0.001891 | -0.274714 | 1.000000 | -0.367992 | -0.279547 |
| PaymentMethod_Electronic Check | -0.013628 | 0.003441 | 0.000588 | 0.003597 | 0.004995 | -0.009167 | -0.004548 | 0.016798 | -0.002885 | -0.010610 | ... | 0.007548 | -0.007548 | -0.001004 | 0.001004 | -0.003988 | 0.003988 | -0.384229 | -0.367992 | 1.000000 | -0.390989 |
| PaymentMethod_Mailed Check | 0.006077 | -0.003750 | -0.020795 | 0.014535 | -0.009557 | 0.001759 | -0.019917 | -0.016624 | 0.006218 | -0.002798 | ... | 0.000353 | -0.000353 | 0.001338 | -0.001338 | 0.012078 | -0.012078 | -0.291882 | -0.279547 | -0.390989 | 1.000000 |
63 rows × 63 columns
plt.figure(figsize = (20,20))
sns.heatmap(df.corr(),annot = True)
<AxesSubplot:>
We are finding the values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year.
- Reduced regression model
#Prepare variables
observations = len(df)
variables = df.columns[:-1]
X = df[['Contract_Month-to-month','Tenure','StreamingTV_Yes','StreamingMovies_Yes','MonthlyCharge','Bandwidth_GB_Year']]
y = df['Churn_Yes']
Xc = sm.add_constant(X)
linear_regression = sm.OLS(y,Xc)
fitted_model = linear_regression.fit()
fitted_model.summary()
| Dep. Variable: | Churn_Yes | R-squared: | 0.474 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.474 |
| Method: | Least Squares | F-statistic: | 1501. |
| Date: | Mon, 10 May 2021 | Prob (F-statistic): | 0.00 |
| Time: | 16:29:40 | Log-Likelihood: | -2796.6 |
| No. Observations: | 10000 | AIC: | 5607. |
| Df Residuals: | 9993 | BIC: | 5658. |
| Df Model: | 6 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -0.1355 | 0.017 | -7.871 | 0.000 | -0.169 | -0.102 |
| Contract_Month-to-month | 0.2376 | 0.006 | 36.939 | 0.000 | 0.225 | 0.250 |
| Tenure | -0.0254 | 0.001 | -22.677 | 0.000 | -0.028 | -0.023 |
| StreamingTV_Yes | 0.0812 | 0.008 | 9.563 | 0.000 | 0.065 | 0.098 |
| StreamingMovies_Yes | 0.1068 | 0.009 | 11.606 | 0.000 | 0.089 | 0.125 |
| MonthlyCharge | 0.0020 | 0.000 | 16.247 | 0.000 | 0.002 | 0.002 |
| Bandwidth_GB_Year | 0.0002 | 1.36e-05 | 15.579 | 0.000 | 0.000 | 0.000 |
| Omnibus: | 986.974 | Durbin-Watson: | 1.971 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 336.853 |
| Skew: | 0.184 | Prob(JB): | 7.13e-74 |
| Kurtosis: | 2.180 | Cond. No. | 2.34e+04 |
- Explain data analysis process
The variable selection technique was using correlation and heatmaps to figure out the variables that correlate the most with churn.
The model evaluation metrics are shown in the regression results
Plot Residuals:
import matplotlib as mpl
from sklearn.linear_model import Lasso, LassoCV, Ridge, RidgeCV
from sklearn.model_selection import cross_val_predict, train_test_split
from yellowbrick.regressor import AlphaSelection, PredictionError, ResidualsPlot
mpl.rcParams['figure.figsize'] = (10,5)
#Create Test and Training dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Plot Residuals
model = Ridge()
visualizer = ResidualsPlot(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()
# Error Model
model = Lasso()
visualizer = PredictionError(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.poof()
Different coefficiants have different affects on the results of churn. Some have a higher correlation with churn and some have a lower correlation with churn. We found that values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year. Other variables have a lower correlation with churn or a higher correlation with NO churn.
The limitations of this data analysis is only running one type of model on the dataset, which can cause inacurate results. The dataset is also 10,000 rows, which is a lot, but not nearly enough to encompass all of the people that could be affected by an analysis like this. There could be so many more people and data that just were not included in this data set that could change the results.
A course of action a company could take based on these results is trying to get customers to sign up for other contract lengths rather than month-to-month, as this contract length has a high correlation with churn.
How do services and churn correlate?
The goal of this analysis is to find the services that correlate with customer churn. For example, what services do the customers with the highest churn rate use?
- Assumptions of multiple regression:
Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.
Logistic regression requires the observations to be independent of each other.
Logistic regression requires there to be little or no multicollinearity among the independent variables.
Logistic regression assumes linearity of independent variables and log odds.
Logistic regression typically requires a large sample size.
- We are using Python because it is very effective to use for data analysis. There are also many different packages that can be added to python to significantly increase the functionality, such as pandas and numpy.
- Logical regression will allow us to analyze each different type of service variable and find out the services and variables that correlate with tenure. An organization would then be able to use this information to try and decrease overall customer churn.
Data preparation will consist of cleaning up bad data, such as replacing nulls, removing duplicated rows, etc. We will also convert categorical variables to numerical so we can run linear regression.
The target variable is Churn. The predictor variables are all variables besides customer demographics, such as city, state, age, gender, location, and job. These are variables that the company cannot control and can not change. The other variables are possible for the company to try to change to increase tenure. We will be using the services variables such as Contract, Payment Methods, and Internet Service.
To prepare the data for analysis, we will be removing unneeded columns, describing the data, types, finding missing information, deleting duplicates, etc.
# Importing the data
df = pd.read_csv('Churn Data\churn_clean.csv')
# Information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CaseOrder 10000 non-null int64 1 Customer_id 10000 non-null object 2 Interaction 10000 non-null object 3 UID 10000 non-null object 4 City 10000 non-null object 5 State 10000 non-null object 6 County 10000 non-null object 7 Zip 10000 non-null int64 8 Lat 10000 non-null float64 9 Lng 10000 non-null float64 10 Population 10000 non-null int64 11 Area 10000 non-null object 12 TimeZone 10000 non-null object 13 Job 10000 non-null object 14 Children 10000 non-null int64 15 Age 10000 non-null int64 16 Income 10000 non-null float64 17 Marital 10000 non-null object 18 Gender 10000 non-null object 19 Churn 10000 non-null object 20 Outage_sec_perweek 10000 non-null float64 21 Email 10000 non-null int64 22 Contacts 10000 non-null int64 23 Yearly_equip_failure 10000 non-null int64 24 Techie 10000 non-null object 25 Contract 10000 non-null object 26 Port_modem 10000 non-null object 27 Tablet 10000 non-null object 28 InternetService 10000 non-null object 29 Phone 10000 non-null object 30 Multiple 10000 non-null object 31 OnlineSecurity 10000 non-null object 32 OnlineBackup 10000 non-null object 33 DeviceProtection 10000 non-null object 34 TechSupport 10000 non-null object 35 StreamingTV 10000 non-null object 36 StreamingMovies 10000 non-null object 37 PaperlessBilling 10000 non-null object 38 PaymentMethod 10000 non-null object 39 Tenure 10000 non-null float64 40 MonthlyCharge 10000 non-null float64 41 Bandwidth_GB_Year 10000 non-null float64 42 Item1 10000 non-null int64 43 Item2 10000 non-null int64 44 Item3 10000 non-null int64 45 Item4 10000 non-null int64 46 Item5 10000 non-null int64 47 Item6 10000 non-null int64 48 Item7 10000 non-null int64 49 Item8 10000 non-null int64 dtypes: float64(7), int64(16), object(27) memory usage: 3.8+ MB
# describing the data
df.describe()
| CaseOrder | Zip | Lat | Lng | Population | Children | Age | Income | Outage_sec_perweek | ... | MonthlyCharge | Bandwidth_GB_Year | Item1 | Item2 | Item3 | Item4 | Item5 | Item6 | Item7 | Item8 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10000.00000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.0000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | ... | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
| mean | 5000.50000 | 49153.319600 | 38.757567 | -90.782536 | 9756.562400 | 2.0877 | 53.078400 | 39806.926771 | 10.001848 | 12.016000 | ... | 172.624816 | 3392.341550 | 3.490800 | 3.505100 | 3.487000 | 3.497500 | 3.492900 | 3.497300 | 3.509500 | 3.495600 |
| std | 2886.89568 | 27532.196108 | 5.437389 | 15.156142 | 14432.698671 | 2.1472 | 20.698882 | 28199.916702 | 2.976019 | 3.025898 | ... | 42.943094 | 2185.294852 | 1.037797 | 1.034641 | 1.027977 | 1.025816 | 1.024819 | 1.033586 | 1.028502 | 1.028633 |
| min | 1.00000 | 601.000000 | 17.966120 | -171.688150 | 0.000000 | 0.0000 | 18.000000 | 348.670000 | 0.099747 | 1.000000 | ... | 79.978860 | 155.506715 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 2500.75000 | 26292.500000 | 35.341828 | -97.082813 | 738.000000 | 0.0000 | 35.000000 | 19224.717500 | 8.018214 | 10.000000 | ... | 139.979239 | 1236.470827 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 |
| 50% | 5000.50000 | 48869.500000 | 39.395800 | -87.918800 | 2910.500000 | 1.0000 | 53.000000 | 33170.605000 | 10.018560 | 12.000000 | ... | 167.484700 | 3279.536903 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 4.000000 | 3.000000 |
| 75% | 7500.25000 | 71866.500000 | 42.106908 | -80.088745 | 13168.000000 | 3.0000 | 71.000000 | 53246.170000 | 11.969485 | 14.000000 | ... | 200.734725 | 5586.141369 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
| max | 10000.00000 | 99929.000000 | 70.640660 | -65.667850 | 111850.000000 | 10.0000 | 89.000000 | 258900.700000 | 21.207230 | 23.000000 | ... | 290.160419 | 7158.981530 | 7.000000 | 7.000000 | 8.000000 | 7.000000 | 7.000000 | 8.000000 | 7.000000 | 8.000000 |
8 rows × 23 columns
# Checking for null values
df.isna().any()
CaseOrder False Customer_id False Interaction False UID False City False State False County False Zip False Lat False Lng False Population False Area False TimeZone False Job False Children False Age False Income False Marital False Gender False Churn False Outage_sec_perweek False Email False Contacts False Yearly_equip_failure False Techie False Contract False Port_modem False Tablet False InternetService False Phone False Multiple False OnlineSecurity False OnlineBackup False DeviceProtection False TechSupport False StreamingTV False StreamingMovies False PaperlessBilling False PaymentMethod False Tenure False MonthlyCharge False Bandwidth_GB_Year False Item1 False Item2 False Item3 False Item4 False Item5 False Item6 False Item7 False Item8 False dtype: bool
# Checking for duplicates
df.duplicated().any()
False
# Removing the customer demograpic variables that we dont need
df = df.drop(['CaseOrder','Customer_id','Interaction','UID','City','State','County','Zip','Lat','Lng',
"Area",'TimeZone','Job'], axis=1)
df.head()
| Population | Children | Age | Income | Marital | Gender | Churn | Outage_sec_perweek | Contacts | ... | MonthlyCharge | Bandwidth_GB_Year | Item1 | Item2 | Item3 | Item4 | Item5 | Item6 | Item7 | Item8 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 38 | 0 | 68 | 28561.99 | Widowed | Male | No | 7.978323 | 10 | 0 | ... | 172.455519 | 904.536110 | 5 | 5 | 5 | 3 | 4 | 4 | 3 | 4 |
| 1 | 10446 | 1 | 27 | 21704.77 | Married | Female | Yes | 11.699080 | 12 | 0 | ... | 242.632554 | 800.982766 | 3 | 4 | 3 | 3 | 4 | 3 | 4 | 4 |
| 2 | 3735 | 4 | 50 | 9609.57 | Widowed | Female | No | 10.752800 | 9 | 0 | ... | 159.947583 | 2054.706961 | 4 | 4 | 2 | 4 | 4 | 3 | 3 | 3 |
| 3 | 13863 | 1 | 48 | 18925.23 | Married | Male | No | 14.913540 | 15 | 2 | ... | 119.956840 | 2164.579412 | 4 | 4 | 4 | 2 | 5 | 4 | 3 | 3 |
| 4 | 11352 | 0 | 83 | 40074.19 | Separated | Male | Yes | 8.147417 | 16 | 2 | ... | 149.948316 | 271.493436 | 4 | 4 | 4 | 3 | 4 | 4 | 4 | 5 |
5 rows × 37 columns
- Univariate and Bivariate visualizations
# Churn Data
churn_data = df['Churn'].value_counts()
churn_data.plot(kind='bar')
plt.title('Churn Rate')
plt.xlabel('Churn')
plt.ylabel('People')
plt.show()
# Payment Data
payment_data = df['PaymentMethod'].value_counts()
payment_data.plot(kind='bar')
plt.title('Payment Method')
plt.xlabel('Method')
plt.ylabel('People')
plt.show()
# Contract Data
payment_data = df['Contract'].value_counts()
payment_data.plot(kind='bar')
plt.title('Contract')
plt.xlabel('Contract Type')
plt.ylabel('People')
plt.show()
# Tenure Data
tenure_data = df['Tenure']
tenure_data.plot(kind='hist')
plt.title('Tenure Data')
plt.xlabel('Tenure Length')
plt.ylabel('Amount of people')
plt.show()
# Tenure Data
tenure_data = df['InternetService'].value_counts()
tenure_data.plot(kind='bar')
plt.title('Internet Service')
plt.xlabel('Internet Type')
plt.ylabel('Amount of people')
plt.show()
# Internet Service to Churn
sns.catplot(x='Churn', hue='InternetService', kind = 'count', data=df)
<seaborn.axisgrid.FacetGrid at 0x1f4d961c220>
# Contract to Churn
sns.catplot(x='Churn', hue='Contract', kind = 'count', data=df)
<seaborn.axisgrid.FacetGrid at 0x1f4d977a9d0>
# Payment Method to Churn
sns.catplot(x='Churn', hue='PaymentMethod', kind = 'count', data=df)
<seaborn.axisgrid.FacetGrid at 0x1f4d95b9760>
# export file to csv
df.to_csv('cleaned_data.csv', index = False)
- Initial regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Convert categorical variables to numerical
df = pd.get_dummies(df)
# Setting features
X_features = df.loc[:,df.columns != 'Churn_Yes']
y_features = df['Churn_Yes']
# Training and testing data
X_train, X_test, y_train, y_Test = train_test_split(X_features, y_features, test_size=0.80, random_state=234)
# Setting model
logmodel = LogisticRegression(solver='liblinear', random_state=0)
# Fitting data
logmodel.fit(X_train, y_train)
LogisticRegression(random_state=0, solver='liblinear')
# Model coefficents
logmodel.coef_
array([[-6.29441766e-06, -4.29107288e-02, -3.18488969e-03,
-2.04789659e-06, -1.15690180e-01, -1.09875969e-01,
-1.61829354e-02, -5.95445177e-03, -1.46938163e-01,
2.62288723e-02, 9.07172309e-04, -4.99427688e-02,
-3.47774534e-02, -4.42437849e-02, -4.20376634e-02,
-5.73861264e-02, -5.36256314e-02, -5.07427450e-02,
-4.56557763e-02, -2.04253196e-03, -3.93340419e-03,
-3.76755360e-03, 1.33946418e-04, -4.08945724e-03,
-1.61335001e-02, 3.92141707e-03, -1.48691752e-03,
-1.15137891e-01, -2.03848965e-02, 6.68589598e-03,
4.27398246e-02, -2.70244979e-02, -2.94143273e-02,
-1.10058848e-02, -2.69311574e-03, -8.61353193e-03,
-5.08546863e-03, 1.87142739e-02, -3.08712647e-02,
-1.54200979e-03, 6.55512272e-04, -1.43545128e-02,
-1.58612958e-02, 2.16229522e-03, -5.19502902e-03,
-8.50397154e-03, -1.02127181e-02, -3.48628246e-03,
-1.10614873e-02, -2.63751325e-03, -4.60772522e-03,
-9.09127534e-03, -2.55070218e-02, 1.18080212e-02,
-3.92720048e-02, 2.55730043e-02, -5.71125507e-03,
-7.98774548e-03, -5.41848834e-03, -4.90158189e-03,
1.14226445e-03, -4.52119478e-03]])
#Perform predictions
logmodel.predict(X_test)
array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)
# Score of predictions
logmodel.score(X_test,y_Test)
0.844625
This initial data set shows a prediction of 84.4%
- Reducing the dataset
We can use correlation and heatmap to figure out the most important variables for regression
df.corr()
| Population | Children | Age | Income | Outage_sec_perweek | Contacts | Yearly_equip_failure | Tenure | MonthlyCharge | ... | StreamingTV_No | StreamingTV_Yes | StreamingMovies_No | StreamingMovies_Yes | PaperlessBilling_No | PaperlessBilling_Yes | PaymentMethod_Bank Transfer(automatic) | PaymentMethod_Credit Card (automatic) | PaymentMethod_Electronic Check | PaymentMethod_Mailed Check | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Population | 1.000000 | -0.005877 | 0.010538 | -0.008639 | 0.005483 | 0.017962 | 0.004019 | -0.004483 | -0.003559 | -0.004778 | ... | 0.006590 | -0.006590 | 0.005882 | -0.005882 | -0.008656 | 0.008656 | -0.001749 | 0.011399 | -0.013628 | 0.006077 |
| Children | -0.005877 | 1.000000 | -0.029732 | 0.009942 | 0.001889 | 0.004479 | -0.020776 | 0.007321 | -0.005091 | -0.009781 | ... | 0.007106 | -0.007106 | -0.003461 | 0.003461 | -0.006828 | 0.006828 | -0.005761 | 0.005771 | 0.003441 | -0.003750 |
| Age | 0.010538 | -0.029732 | 1.000000 | -0.004091 | -0.008047 | 0.001588 | 0.015068 | 0.008577 | 0.016979 | 0.010729 | ... | 0.000197 | -0.000197 | -0.010125 | 0.010125 | -0.002247 | 0.002247 | 0.020689 | -0.000372 | 0.000588 | -0.020795 |
| Income | -0.008639 | 0.009942 | -0.004091 | 1.000000 | -0.010011 | -0.009267 | 0.001233 | 0.005423 | 0.002114 | -0.003014 | ... | 0.003324 | -0.003324 | 0.002186 | -0.002186 | 0.012049 | -0.012049 | -0.013111 | -0.005798 | 0.003597 | 0.014535 |
| Outage_sec_perweek | 0.005483 | 0.001889 | -0.008047 | -0.010011 | 1.000000 | 0.003994 | 0.015092 | 0.002909 | 0.002932 | 0.020496 | ... | -0.008952 | 0.008952 | -0.018086 | 0.018086 | 0.011208 | -0.011208 | -0.007203 | 0.011445 | 0.004995 | -0.009557 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| PaperlessBilling_Yes | 0.008656 | 0.006828 | 0.002247 | -0.012049 | -0.011208 | -0.012632 | -0.002649 | 0.016603 | -0.000655 | 0.001348 | ... | 0.016353 | -0.016353 | -0.000692 | 0.000692 | -1.000000 | 1.000000 | 0.005811 | 0.001891 | 0.003988 | -0.012078 |
| PaymentMethod_Bank Transfer(automatic) | -0.001749 | -0.005761 | 0.020689 | -0.013111 | -0.007203 | -0.008311 | 0.021861 | -0.009500 | -0.014228 | 0.003266 | ... | -0.004482 | 0.004482 | 0.013449 | -0.013449 | -0.005811 | 0.005811 | 1.000000 | -0.274714 | -0.384229 | -0.291882 |
| PaymentMethod_Credit Card (automatic) | 0.011399 | 0.005771 | -0.000372 | -0.005798 | 0.011445 | 0.017390 | 0.003508 | 0.007344 | 0.011513 | 0.011922 | ... | -0.004575 | 0.004575 | -0.013997 | 0.013997 | -0.001891 | 0.001891 | -0.274714 | 1.000000 | -0.367992 | -0.279547 |
| PaymentMethod_Electronic Check | -0.013628 | 0.003441 | 0.000588 | 0.003597 | 0.004995 | -0.009167 | -0.004548 | 0.016798 | -0.002885 | -0.010610 | ... | 0.007548 | -0.007548 | -0.001004 | 0.001004 | -0.003988 | 0.003988 | -0.384229 | -0.367992 | 1.000000 | -0.390989 |
| PaymentMethod_Mailed Check | 0.006077 | -0.003750 | -0.020795 | 0.014535 | -0.009557 | 0.001759 | -0.019917 | -0.016624 | 0.006218 | -0.002798 | ... | 0.000353 | -0.000353 | 0.001338 | -0.001338 | 0.012078 | -0.012078 | -0.291882 | -0.279547 | -0.390989 | 1.000000 |
63 rows × 63 columns
plt.figure(figsize = (20,20))
sns.heatmap(df.corr(),annot = True)
<AxesSubplot:>
- Reduced regression model
We are finding the values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year.
# Setting features
X_features = df[['Contract_Month-to-month','Tenure','StreamingTV_Yes','StreamingMovies_Yes','MonthlyCharge','Bandwidth_GB_Year']]
y_features = df['Churn_Yes']
# Training and testing data
X_train, X_test, y_train, y_Test = train_test_split(X_features, y_features, test_size=0.80, random_state=234)
# Setting model
logmodel = LogisticRegression(solver='liblinear', random_state=0)
# Fitting data
logmodel.fit(X_train, y_train)
LogisticRegression(random_state=0, solver='liblinear')
# Model coefficents
logmodel.coef_
array([[ 2.54696051e+00, -2.58610401e-01, 1.28195044e+00,
1.81243169e+00, 1.09267865e-02, 2.01241639e-03]])
#Perform predictions
logmodel.predict(X_test)
array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)
# Score of predictions
logmodel.score(X_test,y_Test)
0.89075
This reduced data set shows a prediction of 89%. This is about a 5% increase over the initial data set.
- Explain data analysis process
The variable selection technique was using correlation and heatmaps to figure out the variables that correlate the most with churn.
The model evaluation metrics are shown by the prediction score. The reduced data set has a 5% higher prediction than the initial data set.
Different coefficiants have different affects on the results of churn. Some have a higher correlation with churn and some have a lower correlation with churn. We found that values that correlate with Churn_Yes. These values are Contract_Month-to-month, Tenure, StreamingTV_Yes, StreamingMovies_Yes, MonthlyCharge, Bandwidth_GB_Year. Other variables have a lower correlation with churn or a higher correlation with NO churn.
The limitations of this data analysis is only running one type of model on the dataset, which can cause inacurate results. The dataset is also 10,000 rows, which is a lot, but not nearly enough to encompass all of the people that could be affected by an analysis like this. There could be so many more people and data that just were not included in this data set that could change the results.
A course of action a company could take based on these results is trying to get customers to sign up for other contract lengths rather than month-to-month, as this contract length has a high correlation with churn.